[Day-11] Stanza 進行繁體中文 Dependency Parsing 的 pipeline

14th鐵人賽

jameshuang

2022-09-26 23:55:25

894 瀏覽

分享至

Day-11 內容

Stanza 進行繁體中文 Dependency parsing 的 pipeline
- Stanza 使用 GSD 模型的 pipeline
- 什麼是 Universal Dependencies
- 關於 UD Chinese GSD

Stanza 進行繁體中文 Dependency parsing 的 pipeline

昨天 — [Day-10] 繁體中文的 Dependency Parsing 方法的內容中有提到，現階段我選擇以 Stanza 針對 UD Chinese GSD 所訓練出的模型來產生繁體中文新聞文章的 Dependency parsing 結構。這個 UD Chinese GSD 是由 Google 表著與轉換的繁體中文 Universal Dependencies Treebank。

之所以選擇在分析意見句與其 Dependency Parsing 結構間的關係前，先弄懂 stanza 在 Dependency Parsing 任務上的 pipeline，是因為希望之後遇上 Dependency Parsing 的結果不合理時，可以逐個檢查 pipeline 中的流程輸出，並找出出問題的環節。

Stanza 使用 GSD 模型的 pipeline

在 Stanza Dependency Parsing 頁面中有描述到，dependency parsing 這一任務是由名為 DepparseProcessor 的 Python class 所執行。而執行 DepparseProcessor 則需要具備 TokenizeProcessor、MWTProcessor、POSProcessor 和 LemmaProcessor 這幾個各自負責不同任務的 Python class。也就是說使用 GSD 模型的 Stanza pipeline會依照以下步驟對輸入文字進行處理：

Tokenization
- 將繁體中文句切成以單詞為的單位的序列
Multi-Word Token (MWT) Expansion
- 由於繁體中文並不存在 MWT 的性質，所以在繁體中文 Dependency Parsing 的 pipline 中不存在此處理流程。
Note: Only languages with multi-word tokens (MWT), such as German or French, require MWTProcessor; other languages, such as English or Chinese, do not support this processor in the pipeline.
Part-of-Speech & Morphological Features
- 在這一步驟當中會為輸入字串中的每個單詞加上對應的 POS 與 Feature 標記（下一部分中會提到）。
Lemmatization
- 這部分還要再去研究 UD Chinese GSD 的 Lemmas 有哪些標註。

Dependency Parsing

此步驟標示出單詞間的關係，標記示意如下：

id: 1	word: 媒體	head id: 17	head: 表示	deprel: nsubj
id: 2	word: 關注	head id: 17	head: 表示	deprel: advcl
id: 3	word: 數位	head id: 4	head: 部	deprel: compound
id: 4	word: 部	head id: 5	head: 部長	deprel: nmod
id: 5	word: 部長	head id: 13	head: 看法	deprel: nmod
id: 6	word: 唐	head id: 5	head: 部長	deprel: appos
id: 7	word: 鳳	head id: 6	head: 唐	deprel: flat:name
id: 8	word: 對	head id: 13	head: 看法	deprel: case
id: 9	word: 數位	head id: 13	head: 看法	deprel: nmod
id: 10	word: 中介	head id: 13	head: 看法	deprel: nmod
id: 11	word: 服務	head id: 12	head: 法	deprel: compound
id: 12	word: 法	head id: 13	head: 看法	deprel: nmod
id: 13	word: 看法	head id: 17	head: 表示	deprel: obl
...

那麼什麼是 Universal Dependencies？ UD Chinese GSD 又包含了什麼內容？將在接下來的兩個段落帶來我個人整理的資訊。

什麼是 Universal Dependencies

本段內容參考自 Universal Dependencies、Introduction

Universal Dependencies (UD) 是一個框架，用於不同人類語言對語法（詞性、形態特徵和句法依賴）進行一致性的註釋，其目標是促進多語言解析器開發、跨語言學習和從語言類型學的角度進行解析研究。

註釋方案（annotation scheme）基於 Stanford dependencies（de Marneffe 等人，2006、2008、2014）、Google universal part-of-speech tags（Petrov 等人，2012 年）和 Interset interlingua for morphosyntactic tagsets（Zeman，2008）。

UD 是一個開放的社區項目，擁有 300 多名貢獻者，以 100 多種語言生成近 200 個樹庫。

UD 在設計上有下列六項特點：

UD 需要在個別語言的語言分析基礎上令人滿意。
UD 需要有利於語言類型學（linguistic typology），即為實現跨語言和語系的跨語言並行性提供合適的基礎。
UD 必須適合人工註釋者進行快速、一致的註釋。
UD 必須易於被非語言學家理解和使用，無論是語言學習者還是對語言處理有平淡需求的工程師。
UD 必須適合高精度的計算機解析。
UD 必須支持下游的語言理解任務（關係提取、閱讀理解、機器翻譯……）。

UD 的標註方式紀錄在 UD Guidelines

關於 UD Chinese GSD

本段內容參考自 UD Chinese GSDhttps://universaldependencies.org/treebanks/zh_gsd/index.html

UD Chinese GSD 是由 Google 表著與轉換的繁體中文 Universal Dependencies Treebank。

Tokenization and Word Segmentation

由於繁體中文中，單詞與單詞間沒有以空白隔開，所以如果要以單詞而非字為單位做 Tokenization 時，就需要先做Word Segmentation ，用在繁體中文上又叫做中文斷詞。要達到良好的中文斷詞需要大量的語料庫與良好的演算法或模型。

UD Chinese GSD 所使用的語料庫中包含了：
- 4997 sentences and 123291 tokens
- 122962 tokens (100%) that are not followed by a space
- (DOES NOT contain words with spaces)

另外，UD Chinese GSD 中的標註項目包含下列幾項：

POS Tags

POS Tag 用於標記單詞的詞性

UD Chinese GSD 的 POS tag 又在細分成下面兩種：
- UPOS ，在UD Chinese GSD 中以 non-UD style 手動標註，再經由自動轉換成 UD style
- XPOS，在UD Chinese GSD 中手動標註

POS Tags 的種類：
ADJ – ADP – ADV – AUX – CCONJ – DET – NOUN – NUM – PART – PRON – PROPN – PUNCT – SYM – VERB – X

以上這些 POS Tags 的詳細說明在 Universal POS tags

由於 Universal POS tags 在動詞上只會標記出 VERB 一個類別，而 CKIP Transformers 在 Part-of-Speech Tagging 詞性標記任務中的動詞有做進一步的分類，例如：VE、VA 等不同類型的動詞標記，所以後續會搭配使用。

Relations

Relation 標記是 Dependency parsing 中，詞與詞間的關係標註（常表示成箭頭指向符號的標註）

Relations 在UD Chinese GSD 中以 non-UD style 手動標註，再經由自動轉換成 UD style

Relations 標注的種類：
acl – acl:relcl – advcl – advmod – amod – appos – aux – aux:pass – case – cc – ccomp – clf – compound – compound:ext – conj – cop – csubj – csubj:pass – det – discourse – discourse:sp – dislocated – flat:foreign – flat:name – iobj – mark – mark:adv – mark:rel – nmod – nmod:tmod – nsubj – nsubj:pass – nummod – obj – obl – obl:patient – orphan – parataxis – punct – reparandum – root – vocative – xcomp

這些 Relations 標注對於意見句結構分析任務相當重要，詳細的說明在 Universal Dependency Relations

Features

在UD Chinese GSD 的標注過程中由程式分配，有一些手動更正，但不是完整的手動驗證。
Features 的種類：
Aspect – Case – Number – NumType – PartType – Person – Polarity – Voice

以下是關於 Feature 標註的說明：

Nominal Features
- Number
- Plur
  - NOUN: 人們
  - PART: 們
  - PRON: 他們, 它們, 我們, 牠們, 她們
- Case
- Gen
  - ADP: 之外
  - PART: 的, 之, 地
Degree and Polarity
- Polarity
- Neg
  - ADV: 不, 未, 沒, 別, 無
Verbal Features
- Aspect
- Perf
  - AUX: 了, 過
  - PART: 了
- Prog
  - AUX: 著
- Voice
- Cau
  - ADP: 以
  - VERB: 以, 使, 讓, 使得, 令, 導致, 要求, 派, 派遣, 任命
- Pass
  - AUX: 為
  - VERB: 被, 為
Pronouns, Determiners, Quantifiers
- NumType
- Card
  - NUM: 一, 兩, 三, 1, 3, 12, 5, 2, 8, 10
- Ord
  - NUM: 第一, 第二, 第三, 首次, 第四, 第五, 第1, 第六, 第七, 首位
- Person
- 1
  - PRON: 我, 我們
- 2
  - PRON: 你, 妳, 您
- 3
  - PRON: 他, 其, 她, 它, 他們, 它們, 牠們, 她們, 牠, 祂
Other Features
- PartType
  - Int
    - PART: 呢, 嗎, 啊